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Abstract 

We study and compare three estimators of a discrete monotone distribution: (a) 
the (raw) empirical estimator; (b) the "method of rearrangements" estimator; and (c) 
the maximum hkehhood estimator. We show that the maximum hkehhood estimator 
strictly dominates both the rearrangement and empirical estimators in cases when 
the distribution has intervals of constancy. For example, when the distribution is 
uniform on {0, . . . ,y}, the asymptotic risk of the method of rearrangements estimator 
(in squared £2 norm) is y/{y + 1), while the asymptotic risk of the MLE is of order 
(log2/)/(y + 1). For strictly decreasing distributions, the estimators are asymptotically 
equivalent. 



1 Introduction 

This paper is motivated in large part by the recent surge of acitivity concerning "method 
of rearrangement" e stima t ors for nonparametric e stimation of nionot o ne functions: see, for 



examp le, iFougered (119971). iDette and Pila (120061 ). iDette et al.l (120061 ). IChernozhukov et al. 



( 120091 ) and lAnevski and FougeresI (120071 ). Most of these authors study continuous settings 
and often start with a kernel type estimator of the density, which involves choices of a kernel 
and of a bandwidth. Our goal here is to investigate method of rearrangement estimators and 
compare them to natural alternatives (including the maximum likelihood estimators with 
and without the assumption of monotonicity) in a setting in which there is less ambiguity in 
the choice of an initial or "basic" estimator, namely the setting of estimation of a monotone 
decreasing mass function on the non-negative integers N = {0, 1, 2, . . .}. 

Suppose that p = {px}xen is a probability mass function; i.e. Px > for all a; G N 
and YIxgnP^ ~ ^- primary interest here is in the situation in which p is monotone 

decreasing: Px > Px+i for all x G N. The three estimators of p we study are: 

(a) , the (raw) empirical estimator, 

(b) . the method of rearrangement estimator, 

(c) . the maximum likelihood estimator. 
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Notice that the empirical estimator is also the maximum likelihood estimator when no shape 
assumption is made on the true probability mass function. 

Much as in the continuous case our considerations here carry over to the case of est i mation 



(1987 


), and 


Alamatsaz 


( 


1993) 



offs between discrete and continuous m odels in a related prob lem i nvolving nonparametric 



estima tion of a monotone function, see iBanerjee et al.l (120091 ) and iMaathuis and Hudgens 



mm- 

Distributions from the monotone decreasing family satisfy Ap^ = Px+i — < for all 
a; G N, and may be written as mixtures of uniform mass functions 



Px 



Here, the mixing distribution q may be recovered via 



1.2) 



for any x G N. 

Remark 1.1. From the form of the mass function, it follows that p^ < l/(x + 1) for 
all a; > 0. 

Suppose then that we observe Xi,X2, . . . ,X„ i.i.d. random variables with values in N 
and with a monotone decreasing mass function p. For a; G M, let 



P: 



'n,x = n ^ ^ l{x}(-^j 



4 = 1 



denote the (unconstrained) empirical estimator of the probabilities px- Clearly, there is no 
guarantee that this estimator will also be monotone decreasing, especially for small sample 
size. We next consider two estimators which do satisfy this property: the rearrangement 
estimator and the maximum likelihood estimator (MLE). 

For a vector w = {wq, . . . ,Wk}, let rear(w) denote the reverse-ordered vector such that 
w' = Teai{w) satisfies w'q > w[ > . . . > w'^,. The rearrangement estimator is then simply 
defined as 



P 



•R 



rear (pn). 



We can also write p^^ = supjw : Qniu) < x}, where Qniu) = : Pn,x > u}. 

To define the MLE we again need some additional notation. For a vector w = {wq, . . . ,Wk}, 
let gren(w) be the operator which returns the vector of the k + 1 slopes of the least concave 
majorant of the points 



j=0 



Wi 



J = -1,0, 



k 



Here, we assume that Yljlo ~ 0- "^^^ MLE, also known as the Grenander estimator, is 
then defined as 

= gren(p„). 

Thus, p^^ is the left derivative at x of the least concave majorant (LCM) of the empirical 
distribution function F„(x) = Y17=i l[o,x](-^i) (where we include the point (—1, 0) to find 
the left derivative at x = 0). Therefore, by definition, the MLE is a vector of local averages 
over a partition of {0, . . . ,max{Xi, . . . ,X„}}. This partition is chosen by the touchpoints 
of the LCM with F„. It is easily checked that corresponds to the isotonic estimator for 



multinomial data as described in [Robertson et al.l (Il988l ). pages 7-8 and 38-39. 



We begin our discussion with two examples: in the first, p is the uniform distribution, 
and in the second p is strictly monotone decreasing. To compare the three estimators, we 
consider several metrics: the ik norm for 1 < k < oo and the Hellinger distance. Recall that 
the Hellinger distance between two mass functions is given by 



x>0 

while the ik metrics are defined as 



\\p-p\ 



k 



{T.x>o\P^ - Px\'") l<k<oo, 
s^Px>o\Px - Px\ k = oo. 



In the examples, we compare the Hellinger norm and the ii and £2 metrics, as the behaviour 
of these differs the most. 

Example 1. Suppose that p is the uniform distribution on {0, . . . ,5}. For n = 100 in- 
dependent draws from this distribution we observe p„ = (0.20,0.14,0.11,0.22,0.15,0.18). 
Then p^ = (0.22,0.20,0.18,0.15,0.14,0.11), and the MLE may be calculated as p^ = 
(0.20,0.16,0.16,0.16,0.16,0.16). The estimators are illustrated in Figure □ (left). The dis- 
tances of the estimators from the true mass function p are given in Table [1] (left). The 
maximum likelihood estimator p^ is superior in all three metrics shown. To explore this 
relationship further, we repeated the estimation procedure for 1000 Monte Carlo samples of 
size n = 100 from the uniform distribution. Figure [2] (left) shows boxplots of the metrics for 
the three estimators. The figure shows that here the rearrangement and empirical estimators 
have the same behaviour; a relationship which we establish rigorously in Theorem 12. 1[ 

Example 2. Suppose that p is the geometric distribution with p^ = (1 — 6)6^ for a; G N 
and with 6 = 0.75. For n = 100 draws from this distribution we observe Pn,Pn Pn 
as shown in Figure [U (right). The distances of the estimators from the true mass function 
p are given in Table [1] (right). Here, p„ is outperformed by p^ and p^ in all the metrics, 
with p^ performing better in the ii and £2 metrics, but not in the Hellinger distance. These 
relationships appear to hold true in general, see Figure [2] (left) for boxplots of the metrics 
obtained through Monte Carlo simulation. 
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Figure 1: Illustration of MLE and monotone rearrangement estimators: empirical propor- 
tions (black dots), monotone rearrangement estimator (dashed line), MLE (solid line), and 
the true mass function (grey line). Left: the true distribution is the discrete uniform; and 
right: the true distribution is the geometric distribution with 6 = 0.75. In both 
sample size of n = 100 was observed. 







Table 


1: Distances between true p and estimators 










Example [1] Example [2] 








H{p,p) 


\\P~P\\2 \\P-P\\l H{p,p) \\p-p\\2 


\\P-P\\i 


p 
p 
p 


= Pn 
= Pn 
= Pn 


0.08043 
0.08043 
0.03048 


0.09129 0.2 0.1641 0.07425 
0.09129 0.2 0.1290 0.06115 
0.03651 0.06667 0.09553 0.06302 


0.2299 
0.1821 
0.1887 



The above examples illustrate our main conclusion: the MLE preforms better when the 
true distribution p has intervals of constancy, while the MLE and rearrangement estimators 
are competitive when p is strictly monotone. Asymptotically, it turns out that the MLE is 
superior if p has any periods of constancy, while the empirical and rearrangement estimators 
are equivalent. However, if p is strictly monotone, then all three estimators have the same 
asymptotic behaviour. 

Both the MLE and monotone rearrangement estimators have been considered in the 
literature for the decreasing probability density function. The MLE, or Grenander estimator, 
has been studied extensively, and much is known about its behaviour. In particular, if the 
true density is locally strictly decreasing, then the estimator converges at a rate of n^/^, 
and if the true den s ity is locally flat, the n the estimator converges at a rate of n^/^, cf. 
Prakasa Rao ( 1969 ): Carolan and Dykstral (1999), and the references therein for a further 



history of the problem. In both cases the limiting distribution is characterized via the LCM 
of a Gaussian process. 

The monotone rearrangement estimator for the continuous density was introduced by 
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Figure 2: Monte Carlo comparison of the estimators: boxplots of m = 1000 distances of the 
estimators Pn (white), (hght grey) and p^ (dark grey) from the truth for a sample size of 
n = 100. Left: the true distribution is the discrete uniform; and right: the true distribution 
is the geometric distribution with 6 = 0.75. 



Fougeres ( 1997 ) (see also Dette and Pilz ( 20061 )). I t is found by calcula tin g the monotone 



rearrangement of a kernel density estimator (see e.g. iLieb and Loss! (119971 )). iFougered (119971 ) 
shows that this estimator also converges at the n^/^ rate if the true density is locally strictly 
decreasing, and it is shown through Monte Carlo simulations that it has better behaviour 
than the MLE for small sample size. The latter is done by comparing the Li metrics for 
different, strictly decreasing, densities. Unlike our Example [2], the Hellinger distance is not 
considered. 

The outline of this paper is as follows. In Section [2] we show that all three estimators are 
consistent. We also establish some small sample size relationships between the estimators. 
Section [3] is dedicated to the limiting distributions of the estimators, where we show that 
the rate of convergence is ra^/^ for all three estimators. Unlike the continuous case, the local 
behaviour of the MLE is equivalent to that of the empirical estimator when the true mass 
function is strictly decreasing. In Section H] we consider the limiting behaviour of the ip and 
Hellinger distances of the estimators. In Section we consider the estimation of the mixing 
distribution q. Proofs and some technical results are given in Section [61 R code to calculate 
the maximum likelihood estimator (i.e. gren(p^)) is available from the website of the first 
author: (will be provided). 



2 Some inequalities and consistency results 

We begin by establishing several relationships between the three different estimators. 
Theorem 2.1. (i). Suppose that p is monotone decreasing. Then 

m8.x{HipP,p),H{p^,p)} < H{pr.,p), (2.3) 

T^^^{\\Pn -p\\kA\Pn -P\\k} < \\Pn - P\\k, I < k < OO. (2.4) 
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(ii). If p is the uniform distribution on {0, ■ ■ ■ ,y} for some integer y, then 

H{pn,p) = H{p^,p), 



\\Pn-p\\k = \\Pn-p\\k, 1 < A; < OO. 

(iii). If pn is monotone then Pn — Pn — Pn- Under the discrete uniform distribution on 
{0, . . . , y}, this occurs with probability 



P{Pn,0 > Pn,l > ■ ■ ■ > Pn,y) ^ , aS U OO. 



1 

{yTi)\ 

If p is strictly monotone with the support of p equal to {0, ... , y} where y G M, then 

P{Pn,0 >Pn,l>---> Pn,y) ^ 1, 

as n cxD. 

Let V denote the collection of all decreasing mass functions on N. For any estimator pvi 
of p eV and > 1 let the loss function be defined by Lk{p,Pn) = J2x>o \Pn,x — Pxl'^, with 
Loo{p,Pn) = sup3,>o \Pn,x — Px\- The risk of pn at p is then defined as 



Rk{P,Pn) = E. 



V 



^ ] \Pn,x Px 



(2.5) 



.3;>0 

Corollary 2.2. When k = 2, and for any sample size n, it holds that 

SUpi?2(P,P^) < SUpi?2(P,pf) = SUpi?2(P,Pn). 
V V V 

Based on these results, we now make the following remarks. 

1. It is always better to use a monotone estimator (either or p^) to estimate a monotone 
mass function. 

2. If the true distribution is uniform, then clearly the MLE is the better choice. 

3. If the true mass function is strictly monotone, then the estimators p^ and p^ should be 
asymptotically equivalent. We make this statement more precise in Sections |3] and |H 
Figure [2] (right) shows that in this case p^ and p^ have about the same performance 
for n = 100. 

4. When only the monotonicity constraint is known about the true p, then, by Corol- 
lary [221 is a better choice of estimator than p,^. 

Remark 2.3. In continuous density estimation one of the most popular measures of dis- 
tance is the Li norm, which corresponds to the ii norm on mass functions. However, for 
discrete mass functions, it is more natural to consider the £2 norm. One of the reasons is made 
clear in the following sections (cf. Theorem 13. 8[ Corollaries 14.11 and 14. 2[ and Remark 14. 4p . 
The £2 space is the smallest space in which we obtain convergence results, without additional 
assumptions on the true distribution p. 
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Figure 3: Comparison of the estimators p„ (white), (hght grey) and (dark grey). 



To examine more closely the case when the true distribution p is neither uniform nor 
strictly monotone we turn to Monte Carlo simulations. Let p'^^y^ denote the uniform mass 
function on {0, . . . ,y}. Figure [3] shows boxplots of m = 1000 samples of the estimators for 
three distributions: 

(a) , (top) p = 0.2p^(3) ^ Q_SpU{7) 

(b) . (centre) p = O.lSp^^^) + O.lp^(^) + 0.75p^("^ 
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(c). (bottom) p = 0.25p^(i) + 0.2p^(3) ^ Q.lSp^^^) + 0.4p^(^) 

On the left we have a small sample size of n = 20, while on the right n = 100. For each 
distribution and sample size, we calculate the three estimators (the estimators Pn,Pn cind 

are shown in white, light grey and dark grey, respectively) and compute their distance 
functions from the truth (Hellinger, ii, and £2)- Note that the MLE outperforms the other 
estimators in all three metrics, even for small sample sizes. It appears also that the more 
regions of constancy the true mass function has, the better the relative performance of the 
MLE, even for small sample size (see also Figure [2]). By considering the asymptotic behaviour 
of the estimators, we are able to make this statement more precise in Section HI 

All three estimators are consistent estimators of the true distribution, regardless of their 
relative performance. 

Theorem 2.4. Suppose that p is monotone decreasing. Then all three estimators Pn,P^ 
and p^ are consistent estimators of p in the sense that 

p{Pn,p) 

almost surely as ^ 00 for pn = Pn,P^ and p^, whenever p{p,p) = H{p,p) or p{p,p) = 
\\p-p\\kA < k <oo. 

As a corollary, we obtain the following Glivenko-Cantelli type result. 

Corollary 2.5. Let F^ix) = T.l=A and F^{x) = Y.l=A, with F{x) = E^Py 
Then 

sup \F^ix) - F{x)\ and sup \F^(x) — F{x) \ 0, 

almost surely. 

3 Limiting distributions 

Next, we consider the large sample behaviour of Pn,Pn and p^. To do this, define the 
fluctuation processes Y^ , and as 

Yn,x = Vn{Pn,x-Px), 
= MPn,x-Px), 
Yn,x = Vnip^^^-Px)- 

Regardless of the shape of p, the limiting distribution of Yn is well-known. In what follows 
we use the notation Yn^^c -^d Yn,x to denote weak convergence of random variables in M (we 
also use this notation for M*^), and Y^ ^Y to denote that the process Y^ converges weakly 
to the process Y . Let Y = {Yx}xeN be a Gaussian process on the Hilbert space £2 with mean 
zero and covariance operator S such that (S e(^x), ^(x')) = PxSx,x' —PxPx', where e^x) denotes a 
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sequence which is one at location x, and zero everywhere else. The process is well-defined, 
since 



traces = E [I = - p,) < oo. 



x>0 



For background on Gaussian processes on Hilbert spaces we refer to iParthasarathyl (119671 ). 



Theorem 3.1. For any mass function p, the process Yn satisfies Yn ^ Y in £2- 

Remark 3.2. We assume that Y is defined only on the support of the mass function p. 
That is, let k = sup{x : px > 0}. If k < 00 then Y = {l^j^^Q. 

3.1 Local Behaviour 

At a fixed point x there are only two possibilities for the true mass function p: either x 
belongs to a fiat region for p (i.e. pr = . . . = Px = ■ ■ ■ = Ps for some r < a; < s), or p is 
strictly decreasing at x: Px-i > Px > Px+i- In the first case the three estimators exhibit 
different limiting behaviour, while in the latter all three have the same limiting distribution. 
In some sense, this result is not surprising. Suppose that x is such that px-i > Px > Px+i- 
Then asymptotically this will hold also for p„ : Pn,x-k > Pn,x > Pn,x+k for k > 1 and for 
sufficiently large n. Therefore, in the rearrangement of p„ the values at x will always stay 
the same, i.e. p^^ = Pn,x- Similarly, the empirical distribution function F„ will also be 
locally concave at x, and therefore both x,x — 1 will be touchpoints of F„ with its LCM. 
This implies that p^^. = Pn,x- 

On the other hand, suppose that x is such that px-i = Px = Px+i- Then asymptotically 
the empirical density will have random order near x, and therefore both re-orderings (either 
via rearrangement or via the LCM) will be necessary to obtain p^,^. and j^.^.. 

3.1.1 When p is fiat at x. 

We begin with some notation. Let q = {qx}xen be a sequence, and let r < s be positive 
integers. We define q^^'^^ = {qr, qr+i, • • • , qs-i, Qs} to be the r through s elements of q. 

Proposition 3.3. Suppose that for some r, s G N with s — r > 1 the probability mass 
function p satisfies p^-i > Pr = ■ ■ ■ = Ps > Ps+i- Then 

(r„^)('-'^) rear(F(^''^)), 
(F„^)('-'^) gren(y(^'^)). 

The last statement of the above theorem is the discrete version of the same result in the 



continuous case due to ICarolan and Dykstral (119991 ) for a density with locally fiat regions 



Thus, both the discrete and continuous settings have similar behaviour in this situation. 
Figure m shows the exact and limiting cumulative distribution functions when p = 0.2p^^^^ + 
(same as in Figure [3l top) at locations x = 4 and x = 7. Note the significantly "more 
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Figure 4: The limiting distributions at a; = 4 (left) and at a; = 7 (right) when p = 0.2p^^^^ + 
O.Sp^'-''-': the limiting distributions are shown (dashed) along with the exact distributions 
(solid) of Yn, Y^, Y^ for n = 100 (top) and n = 1000 (bottom). 



discrete" behaviour of the empirical and rearrangement estimators in comparison with the 
MLE. Also note the lack of accuracy in the approximation at x = 4 when n — 100 (top 
left), which is more prominent for the rearrangement estimator. This occurs because x = 4 
is a boundary point, in the sense that > P4, and is therefore least resilient to any global 
changes in p^. Lastly, note that the distribution functions satisfy Fy^ > Fyo > FyR at 
X — A while aX x — 1, FyR > Fyc > Fy^. It is not difficult to see that the relationships 

Yf > Yf > Y4 and Yf < Y^ < Y^ must hold from the definition of (y^)(4'7) = rear(y(^'^)) 
and (y«)(4'7) =gren(y(4,7)). 

Proposition 3.4. Let 9 — Pr — ■ ■ ■ — Ps, and let y^'"'*) denote a multivariate normal 
vector with mean zero and variance matrix {(Ji^y^ j^^ where 

for a'^ — s — r + 1. Let Z he a standard normal random variable independent of Y^^'^\ and 
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let r = s — r + 1. Then 



gren(y 



Vl - 9t Z + t gren(F 



(r,s)^ 



Note that the behaviour of gren(y'^''''*)) and gren(y^^'''^) will be quite different since 



^x=r ^x^''^^ = almost surely, but the same is not true for Y^^' 



Remark 3.5. To match the notation of lCarolan and Dykstral ( 19991 ). note that r gren(y'^^'*')) 
is equivalent to the left slopes at the points {1, . . . ,r}/r of the least concave majorant of 
standard Brownian bridge at the points {0, 1, . . . ,r}/r. This random vector most closely 
matches the left derivative of the least concave majorant of the Brownian bridge on [0, 1], 
which is the process that shows up in the limit for the continuous case. 



3.1.2 When p is strictly monotone at x. 

In this situation, the three estimators Pn,x,Pnx Pnx have the same asymptotic behaviour. 
This is considerably different than what happens for continuous densities, and occurs because 
of the inherent discreteness of the problem for probability mass functions. 

Proposition 3.6. Suppose that for some r, s e N with s — r > the probability mass 
function p satisfies Pr-i > Pr > ■ ■ ■ > Ps > Ps+i- Then 

(Y^)ir,^) Y^'-''^ in 

Remark 3.7. We note that the convergence results of Propositions 13.31 and 13.61 also hold 
jointly. That is, convergence of the three processes {Yn'^\ iXnY'^''^'') ^^^o be 

proved jointly in R'^^''"''"'"^). 



3.2 Convergence of the Process 

We now strengthen these results to obtain convergence of the processes Y^ and Y^ in £2- 
Note that the limit of Yn has already been stated in Theorem 13.11 

Theorem 3.8. Let Y be the Gaussian process defined in Theorem 13. H with p a monotone 
decreasing distribution. Define Y^ and Y^ as the processes obtained by the following trans- 
forms of Y: for all periods of constancy of p, i.e. for all s > r with s — r > 1 such that 

Pr-l > Pr = ■ ■ ■ = Px = ■ ■ ■ = Ps > Ps+1 let 

_ rear(F("''^)) 
^yG)(r,s) _ gren(F("'^)). 

Then Y^ Y^, and Y^ Y^ in fa- 



ll 



The two extreme cases, p strictly monotone decreasing and p equal to the uniform distri- 
bution, may now be considered as corollaries. By studying the uniform case, we also study 
the behaviour of (via Proposition I3.4p . and therefore we consider this case in detail. 

Corollary 3.9. Suppose that p is strictly monotone decreasing. That is, suppose that 
Px > Px+i for all X > 0. Then =^ Y and ^ y in £2- 

3.2.1 The Uniform Distribution 

Here, the limiting distribution y is a vector of length y + 1 having a multivariate normal 
distribution with E\Yx\ = and coy{Yx, Y^) = {y + l)~^Sx,z — {y + 

Corollary 3.10. Suppose that p is the uniform probability mass function on {0, ... , y}, 
where ?/ G N. Then Y^ -^^ rear(y) and Y^ — ^-^ gren(y). 




Figure 5: The relationship between the limiting process Y and the least concave majorant of 
its partial sums for the uniform distribution on {0, . . . , 5}. Left: the slopes of the lines Li, L2 
and L3 give the values gren(y)o, gren(y)i = . . . = gren(y)4 and gren(y)5, respectively. 
Right: the discrete Brownian bridge lies entirely below zero. Therefore, its LCM is zero, and 
also gren(y) = 0. This event occurs with positive probability (see also Figure [H]). 



The limiting process gren(y) may also be described as follows. Let U(-) denote the 
standard Brownian bridge process on [0, 1], and write Uk = X]j=o ^ ~ Then 

we have equahty in distribution of 

[/ = {[/_i,t/o,...,t/,-i,t/J = |u(^^^ : A; = -l,...,|/|. 

In particular we have that f/_i = Uy = X]j=o^' ~ 0. Thus, the process f/ is a discrete 
analogue of the Brownian bridge, and gren(y) is the vector of (left) derivatives of the least 
concave majorant of {(j, Uj) : j = —1, . . . ,y}. Figure [5] illustrates two different realizations 
of the processes Y and gren(F). 



12 



Remark 3.11. Note that if the discrete Brownian Bridge is itself convex, then the hmits 
Y, rear(y) and gren(y) will be equivalent. This occurs with probability 



P {Y = rear(y) = gren(y)) = 
The result matches that in part (iii) of Theorem 12.11 




Figure 6: Limiting distribution of the MLE for the uniform case with y = 9: marginal 
cumulative distribution functions at a; = 0,4,9 (left). The probability that gren(y) = is 
plotted for different values of y (right). For y = 9, it is equal to 0.0999. 



Figure E] examines the behaviour of the limiting distribution of the MLE for several 
values of x. Since this is found via the LCM of the discrete Brownian bridge, it maintains 
the monotonicity property in the limit: that is, gTen{Y)x > gren(y)2;+i. This can easily be 
seen by examining the marginal distributions of gren(y) for different values of x (Figure O 
left). For each x, there is a positive probability that gren(y)2; = 0. This occurs if the discrete 
Brownian bridge lies entirely below zero and then the least concave majorant is identically 
zero, in which case gren(y)a. = for all x = 0, . . . , y (as in Figure right). The probability 
of this event may be calculated exactly using the distribution function of the multivariate 
normal. Figure [6] (right), shows several values for different y. 



4 Limiting distributions for the metrics 

In the previous section we obtained asymptotic distribution results for the three estimators. 
To compare the estimators, we need to also consider convergence of the Hellinger and ik 
metrics. Our results show that and p„ are asymptotically equivalent (in the sense that 
the metrics have the same limit). The MLE is also asymptotically equivalent, but if and 
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only if p is strictly monotone. If p has any periods of constancy, then the MLE has better 
asymptotic behaviour. Heuristically, this happens because, by definition, is a sequence 
of local averages of F, and averages have smaller variability. Furthermore, the more and 
larger the periods of constancy, the better the MLE performs, see, in particular. Proposition 
14.51 below. These results quantify, for large sample size, the observations of Figure [31 

The rate of convergence of the ^2 nietric is an immediate consequence of Theorem 13.81 
Below, the notation Zi <s denotes stochastic ordering: i.e. PiZ^ > x) < PiZ^ > x) for 
all X G M (the ordering is strict if both inequalities are replaced with strict inequalities). 

Corollary 4.1. Suppose that p is a monotone decreasing distribution. Then, for any 
2 < < cx), 



k, 



Vn\\Pn - P\\k = WYnllk \\y\ 
V^WP^ - P\\k = IK^Wk \\Y\ 

M\l^-p\\k = \\Y^\\k \\Y%k<s\\Y\\k. 

If p is not strictly monotone, then <s may be replaced with <s- The above convergence also 
holds in expectation (that is, — >• and so forth). Furthermore, 

E[\\Y%'^<E[\\Y\\'^=Y,V.{l-V.). 

with equality if and only if p is strictly monotone. 

Convergence of the other two metrics is not as immediate, and depends on the tail 
behaviour of the distribution p. 

Corollary 4.2. Suppose that p is such that Ylix>a \fVx < oo- Then 

\/^l|Pn = ll'J^nlll -^d \\Y\\l, 

V^WPn - P\\l = \\Yn\\l ^d \\Y\\u 

V^\\l^-p\\i = \\Y^\\i ^d ||V'''||i<5||V'||i. 

If p is not strictly monotone, then <5 may be replaced with <5'. The above convergence also 
holds in expectation, and 

x>Q 

with equality if and only if p is strictly monotone. 

Convergence of the Hellinger distance requires an even more stringent condition. 
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Corollary 4.3. Suppose that k = sup{x : > 0} < oo. Then 



K 

-y 



x=0 



K 

-y 



x=0 
1 >^ 



y2 

X 

Px 
y2 

X 

Px 



-y 



G\2 



<< 



x=0 



Px 



If p is not strictly monotone, then <s may be replaced with <s- The distribution of 
^^=0 ^x IPx is chi-squared with k degrees of freedom. The above convergence also holds 
in expectation, and 



E 



x=0 



Px 



< E 



K V^2 



with equality if and only if p is strictly monotone. 

Remark 4.4. We note that if ^2,>o y/p^ = oo, then ^2.>q = oo almost surely, and if 
K = oo, then Ylx>o^x /Px is also infinite almost surely. This implies that for the empirical 
and rearrangement estimators, the conditions in Corollaries 14.21 and 14.31 are also necessary 
for convergence. The same is true for the Grenander estimator, when the true distribution 
is strictly decreasing. 

Proposition 4.5. Let p be a decreasing distribution, and write it in terms of its intervals 
of constancy. That is, let 



Px 



6*2 if X G Ci, 

where where 6i > 6i^i for all 2 = 1,2,..., and where {Cj}j>i forms a partition of N. Then 



E 



x>0 



G\2 



EE". (7 

i>i j=i 



Also, if /€ = sup{x : > 0} < 00, then 

iXx''? 



E 



E 

x=0 



Px 



y.y.[\-' 



i>\ i=i 



This result allows us to explicitly calculate exactly how much "better" the performance 
of the MLE is, in comparison to Y and Y^. With M-valued random variables, it is standard 
to compare the asymptotic variance to evaluate the relative efficiency of two estimators. We, 
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on the other hand, are deahng with R^-valued processes. Consider some process W G M^, 
and let Sty denote its covariance matrix (of size N x N). Then the trace norm of Sty is equal 
to the expected squared £2 norm of W, 



E\\\W\ 



■'WW trace 



i>l 



where {Xi}i>i denotes the eigenvalues of T,w- Therefore, Corollary 14.11 tells us that, asymp- 
totically, is more efficient than and Y, in the sense that 



-lY^ I I trace 



< IIS 



I I trace 



■'YW trace ; 



with equality if and only if p is strictly decreasing. Furthermore, Proposition 14.51 allows us 
to calculate exactly how much more efficient Y'^ is for any given mass function p. 

Suppose that p has exactly one period of constancy on r < x < s, and let r = s—r+1 > 2. 
Further, suppose that px = 0* for r < x < s. Then 



E[\\YY,]-E[\\Y^\\l] 



E[\\Y\\l]-E[\\Y^\\l] 



i=l 



In particular, if p is the uniform distribution on {0, . . . ,y}, then we find that E [||1^^||2] 
y/{y + 1), whereas E [||V*^||2] behaves like \ogy/{y + 1), and is much smaller. 
Note that if p is strictly monotone, then we obtain 



E 



x>0 



G\2 



Y,^^{l-^,)=E 



i>l 



E^i 

.x>0 



as required. Also, if p is the uniform probability mass function on {0, . . . ,?/}, we conclude 
that 



E 



E 

x'=0 



gren(y)^ 

Px 



1=1 



where logy — 0.5 < X]i'=i(* + 1) ""^ < log(y + 1). 

Lastly, consider a distribution with bounded support, and fix r < s where p is strictly 



monotone on {r, . . . , s}. That is, we have that pr-i > Pr > ■ ■ ■ > Ps > Ps+i- 
by Px = Px ioT X < r and x > s, and Px = Yl!^x=rPx/ (s — r + 1) for x G {r, . . 
difference in the expected Hellinger metrics under the two distributions is 



Next define p 
, s}. Then the 



E„ 



x=0 



Px 



— Ep 



x=0 



Px 



E^ 



J 



where t = s — r + 1. Therefore, the longer the intervals of constancy in a distribution, the 
better the performance of the MLE. 
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Remark 4.6. From Theorem 1.6.2 of 



Robertson et al 



(119881 ) it follows that for any a; > 



This result may also be proved using the method used to show Proposition 14.51 Note that 
this pointwise inequality does not hold in general for Y*-^ replaced with Y^. 

Corollaries 4.1 and 4.2 then translate into statements concerning the limiting risks of the 
three estimators p„, p^, and pP as follows, where the risk was defined in (12.51) . In particular, 
we see that, asymptotically, both p^ and p„ are inadmissible, and are dominated by the 
maximum likelihood estimator p^. 

Corollary 4.7. For any 2 < k < oo, and any p & V, the class of decreasing probability 
mass functions on N, 

n'/'R,{pM^E[\\Y\\t], 
n'/'R,{p,p^)^E[\\Y\\t], 
n'/'R,{p,p^,)^E[\\Y^\\l]<E[\\Y\\'l]. 

The inequality in the last line is strict if p is not strictly monotone. The statements also 
hold for = 1 under the additional hypothesis that '^^^q y/p^ < oo. 



5 Estimating the mixing distribution 

Here, we consider the problem of estimating the mixing distribution q in (II. ip . This may 
be done directly via the estimators of p and the formula (11.21) . Define the estimators of the 
mixing distribution as follows 

qn,x = -{x + l)Apn,x, 

= -(a; + i)Ap;,, 

= -(^ + l)Ap^,.. 

Each of these estimators sums to one by definition, however g„ is not guaranteed to be 
positive. The main results of this section are consistency and y^-rate of convergence of 
these estimators. 

Theorem 5.1. Suppose that p is monotone decreasing and satisfies "^^^qXP^ < oo. Then 
all three estimators g„ , and are consistent estimators of q in the sense that 

p{Qn, g) ^ 

almost surely as n — >• oo for g„ = g„, q^ and g^, whenever p{q, q) = H{q, q) or p(g, q) = 
I |g — g| 1 < /c < oo. 
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To study the rates of convergence we define the the fiuctuation processes Z^, Z^, and Z^ 

as 

Zn,x = \/n{qn,x ~ Qx), 
Zn,x = Vn{q^^^-qx), 

Zn,x = \^iQ^,x-Qx), 

with hmiting processes defined as 

Theorem 5.2. Suppose that p is such that k = supja: > : > 0} < oo. Then Z„ =^ 
Z,Z^ Z^ and Z^ Z^. Furthermore, for any /c > 1, \ \Zn\\k -^d \ \Z\\k,\\Z^\\k \\Z^\\k 
and I l^^l |fc — *d I |fc, and these convergences also hold in expectation. Also, nH^{qn, q) ^d 
T!l=oZllqx, nH^{qn,q) -^d T.t=oiZ^f/qx and nH^{qn,q) ^d Yl=o{Z'^f/qx, and these 
again also hold in expectation. 

As before, we have asymptotic equivalence of all three estimators if p is strictly decreasing 
(cf. Corollary 13.91) . To determine the relative behaviour of the estimators q^ and cf^ we turn 
to simulations. Since qn is not guaranteed to be a probability mass function (unlike the other 
two estimators), we exclude it from further consideration. 

In Figure [3, we show boxplots of m = 1000 samples of the distances ii{q,q),h{q,q) and 
H{q, q) for q = q^ (light grey) and q = q*^ (dark grey) with n = 20 (left), n = 100 (centre) 
and n = 1000 (right). From top to bottom the true distributions are 

(a) p = p^(^), 

(b) p = 0.2p^(3) ^Q_gpU(7)^ 

(c) p = 0.25p^(i) + 0.2p^(3) ^ o,i^pU{5) ^ o.4p^(^), and 

(d) p is geometric with 6 = 0.75. 

We can see that q^ has better performance in all metrics, except for the case of the strictly 
decreasing distribution. As before, the flatter the true distribution is, the better the relative 
performance of q^. Notice that by Corollary 13.91 and Theorem 15.21 the asymptotic behaviour 
(i.e. rate of convergence and limiting distributions) of the I2 norm of q^ and q^ should be 
the same if p is strictly decreasing. 

Remark 5.3. For k = 00, the process {xYn^x : a; G N} is known to converge weakly in £2 
if and only if Ylx>o ^"^Px < cxp, while the converg ence is know to hold in ii if and only if 



J2x>o^Vp^ < oo] see e.g. lAraujo and Gina (Il98d . Exercise 3.8.14, page 205). We therefore 



conjecture that Z^ and Z^ converge weakly to Z^ and Z^ in £2 (resp. ii) if and only if 
T.x>o^^Px < 00 (resp. Ex>o^v^ < 
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6 Proofs 

Proof of Remark \l.l[ This bound follows directly from the definition of p, since 

□ 

In the next lemma, we prove several useful properties of both the rearrangement and 
Grenander operators. 

Lemma 6.1. Consider two sequences p and q with support S, and let (/?(■) denote either the 
Grenander or rearrangement operator. That is, <f{p) = gren(j9) or ip{p) = rear(j9). 

1. For any increasing function / : S i— M, 

fx^{p)x < fxPx- (6.6) 

xes xes 

2. Suppose that \E' : M M_,_ is a non-negative convex function such that ^I'(O) = 0, and 
that q is decreasing. Then, 

J2'^{^{p),-q,) < (6.7) 

xes xes 

3. Suppose that |S| is finite. Then ip{p) is a continuous function of p. 

Proof 1. Suppose that S = {si, . . . , S2}, where it is possible that S2 = 00. Then it is clear 
from the properties of the rearrangement and Grenander operators that 

^2^1 y y 

X = Sl X = S2 X = Sl X = Sl 

for ?/ G S. These inequalities immediately imply (16.61) . since, by summation by parts, 

S2 S2 X—1 S2 

Y ^^p^ = YY (•^f+i " ^y^p^ + Y 

x=s\ x=s\ y=s\ x=s\ 

S2 S2 S2 

= Y^'fy+^ ~ ■fy^ Y + Y 

y=si x=y+l x=si 

and / is an increasing function. 



2. For the Grenander estimator this is simply Theorem 1.6.1 in iRobertson et al.l (119881). 
For th e rearrangement estimator, we adapt the proof from Theorem 3.5 in Lieb and Loss! 
fll997l ). We first write = + where '^+{x) = ^(x) for x > and ^_(x) = ^(x) 
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for X < 0. Now, since is convex, there exists an increasing function such that 
^+ix) = j;^'4t)dt. Now, 



'^+(p^-q^) = / "^'^ip^ - s)ds = / %{p^- s)Iig^<s]ds. 
J qx Jo 

Applying Fubini's theorem, we have that 



Now, the function I[q^<s] is an increasing function of x, and for ip{p) = rear(p), for each 
fixed s we have that (pi^^'^^p — s))^ = 'if'_^_{^p{p)x — s), since is an increasing function. 
Therefore, applying fl6.6l) . we find that the last display above is bounded below by 

■^0 lies J xgs 

The proof for is the same, except that here we use the identity 

/■OO 

■^-{p^-qa:) = / "^'Jp^- s){-%^>s]}ds. 
Jo 

3. Since |S| is finite, we know that p is a finite vector, and therefore it is enough to prove 
continuity at any point a; G S. For ip = rear this is a well-known fact. Next, note that 
if Pn — ^ p, th en the partial sums of p„ also converge to the partial sums of p. From 
Lemma 2.2 of iDurot and Tocquetl (120031 ). it follows that the least concave majorant of 
Pn converges to the least concave majorant of p, and hence, so do their differences. Thus 



□ 



6.1 Some inequalities and consistency results: proofs 

Proof of Theorem \2.1[ (i). Choosing \E'(t) = \t\^ in (16.71) of Lemma [6.11 proves 02.41) . To 
prove fl2.3l) recall that 



PxPx 



x>0 



By Hardy et al. ( 1952 ). Theorem 368, page 261, (or Theorem 3.4 in Lieb and Loss ( 199?! )) 
it follows that 



VPn,xPx < yPn,xPx, 



x>0 



x>0 
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which proves the result for the rearrangement estimator. It remains to prove the same 
for the MLE. Let {Bi}i>i denote a partition of N. By definition, 




1 



X e Bi 



xeBi 



for some partition. Jensen's 



inequahty now imphes that 




which completes the proof. 

(ii) . is obvious. 

(iii) . The second statement is obvious in light of (12.41) with k = oo. To see that the proba- 
bility of monotonicity of the Pn,x^s converges to !/(?/+ 1)! under the uniform distribution, 
note that the event in question is that same as the event that the components of the 
vector {y/n{pn^x — {v + '■ x E {0, . . . ,y}} are ordered in the same way. This vector 
converges in distribution to Z ~ A^j^_|_i(0, E) where E = diag(l/(?/ + 1)) — {y + 

and the probability P{Zi > Z2 > ■ ■ ■ > Zy^i) = l/{y+ 1)! since the components of Z are 
exchangeable. 



Plugging in the discrete uniform distribution on {0, ... , k}, and applying part (ii) of Theorem 
12. H we find that 

ni?2(p,pf) = nR2ip,Pn) = 1 - (n+iyK 
Thus, for any e > 0, there exists a p G P, such that 

ni?2(p,P^) = nR2{p,Pn) > 1 - e. 

Since the upper bound on both risks is one, the result follows. □ 

Proof of Theorem \2.4\ The results of this theorem are quite standard, and we provide a 
proof only for completeness. Let F„ denote the empirical distribution function and F the 
cumulative distribution function of the true distribution p. For any K (large), we have that 
for any x > K, 



□ 



Proof of Corollary \2.S[ For any p G P, we have that 



nR2{p,Pn) < nR2{p,Pn) = 1 



x>0 



Pn,x 
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Fix e > 0, and choose K large enough so that (1 — F{K)) < e/6. Next, there exists an uq 
sufficiently large so that suPq<^<^ \pn^x —Px\< e/3 and \¥n{K) — F{K)\ < e/3 for all n > uq 
almost surely. Therefore, for n > uq 

sup\pn,x-px\ < sup \pn,x - Px\ + \¥4K) - F{K)\ + 2{1 - F{K)) 

x>0 0<x<K 

< e. 

This shows that ||Pn ~p||a; ~^ almost surely for = oo. A similar approach proves the 
result for any 1 < k < oo. Converge nce of H{pn,p) follows since for mass functions H{p, q) < 



\p — q\\i (see e.g. iLe CamI (Il969l ). page 35). Consistency of the other estimators, p^ and 
p^ now follows from the inequalities of Theorem 12.11 □ 

Proof of Corollary {27^ Note that by virtue of the estimators, we have that F^{x) > F„(a;) 
and F^{x) > F„(x) for all x > 0. Now, fix e > 0. Then there exists a K such that '^,j.yKPx < 
e/4. By the Glivenko-Cantelli lemma, there exists an uq such that for all n > no 

sup |F„(x) - F{x)\ < e/4, 

a:>0 

almost surely. Furthermore, by Theorem 12.41 hq can be chosen large enough so that for all 
n > no 

sup \p^,x-Px\ < e/4{K + l), 

x>0 

almost surely. Therefore, for all n > no, we have that 

K 

sup\F^{x)-F{x)\ < Y,\p^,x-Px\ + Y.P^,-+J2p- 

- x=0 x>K x>K 

< e/4 + ^ p,,, + e/4 

x>K 

< e/4:+^Px + e/4 + e/4 < e. 

x>K 

The proof for the rearrangement estimator is identical. □ 
6.2 Limiting distributions: proofs 

Lemma 6.2. Let Wn be a sequence of processes in with 1 < k < oo. Suppose that 



1. snp^E[\\Wn\\t] < oo. 



2. lim„^ooSup„^^>^E[|iy„,^.|'=] = 0. 
Then Wn is tight in ik- 
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Proof. Note that for k < oo, compact sets K are subsets of such that there exists a 
sequence of real numbers for x G N and a sequence — > such that 

1. \wx\ < for all x G N, 

2- Efc>m k^l'' ^ ^rn for all m, 

for all elements w ^ K. Clearly, if the conditions of the lemma are satisfied, then for each 
e > 0, we have that 

P \Wn,x\ < A^ for all a; > 0, and ^ |Wn,x|'' < for all m j > 1 - e 

\ x>m / 

for all n. Thus, Wn is tight in □ 

Proof of Theorem \3.1[ Convergence of the finite dimensional distributions is standard. It 
remains to prove tightness in £2- By Lemma [6.21 this is straightforward, since 

E[\\YX] = Y,Px{1-Px) and E = - p,.)- 

x>0 x>m x>m 

□ 

Throughout the remainder of this section we make extensive use of a set equality for the 
least concave majorant known as the "switching relation" . Let 

Snia) = inf {A; > -1 : F„(A;) - a{k + 1) = sup{F„(?/) - a{y + 1)}} 

= argmax^>_i{F„(A;) - a{k + 1)} (6.8) 

denote the first time that the process F„(?/) — a{y + 1) reaches its maximum. Then the 
following holds 

{sn{a)<x} = {s„(a) < X - 1/2} 

= {l^,x<a}. (6.9) 



For rn ore background (as well as a proof) of this fact see, for example, iBalabdaoui et al. 



(120091) 



Proof of Proposition Let F denote the cumulative distribution function for the function 
p. For fixed t G M it follows from (16.91) that 

PiyZ<t) = P{sn{Px + n-'/H)<x~l/2) 

= P(argmaxJ'>_i{Z„(y)}<x-l/2) (6.10) 

where Zn{y) = n^^'^¥n{y) — {n^^'^Px+t)iy+'^)- Note that for any constant c, argmax-^(Z„(?/)) = 
argmax^(Z„(?/) + c), and therefore we instead take 

Zn{y) = n'/\¥niy)-¥„ir-l))-in'/^px + t)iy-r + l) 
= VM + Wniy)-t{y-r + l), 
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where 



Vn{y) = v^((F„(y)-F„(r-l))-(F(y)-F(r-l))), 
n-^'^Wn{y) = {F{y) - F{r - I)) -p^{y -r + 1) 
= for r — 1 < y < s, 
< otherwise. 

Let U denote the standard Brownian bridge on [0, 1]. It is well-known that Vn{y) =^ V{F{y)) — 
U(F(r — 1)). Also, Wn{y) — > CX3 for y ^ {r — 1, . . . , s}, and it is identically zero otherwise. 
It follows that the limit of (16.101) is 



P (argmax^_i<j^<,{U(F(y)) - U(F(r - 1)) - t{y - r + 1)} < x - 1/2) 



P (arg: 



max, 



r~l<y<s 



{V{F{y)) - U(F(r - 1)) - t{y - r + 1)} < x) 



^Yj,x = r -1, 



for any x G {r, . . . , s}. Note that the process 

{U(F(a;)) - U(F(r - 1)), a; = r - 1, . . . , s} =^ 

and therefore the probability above is equal to 

P (gren(F(^'^)), < t) 

for a; e {r, . . . , s}. Since the half-open intervals [a, h) are convergence determining, this proves 
pointwise convergence of Y^^ to gren(F)a;. 

To show convergence of the rearrangement estimator fluctuation process, note that for 
sufficiently large n we have that Pn,r~k > Pn,x > Pn,s+k for all x G {r, . . . , s} and k > 1. 
Therefore, (p^)^^'^^ = rear((p„)*^'''**-*) and furthermore, since Px is constant here, (Y^)^"^'^^ = 



rear(yi^'*''). The result now follows from the continuous mapping theorem. 



□ 



Proof of Proposition 3. 4. To simplify notation, let Wm = U(F(m — r + 1)) — U(-F(r — 1)) for 



m = 0, 
Write 



s — r-\-l. Also, let = p. 
G 



Ps and then Gr 



F(m-r + l)-F(r-ll 



6m. 



G. 



-W, 



Gn 



where s = s — r + 1. Let Wm = 
shows that E[Wm] = and 

cov(W„,W„ 



mWs/s. Then Wq = = and some calculation 



, . m m \ mm 
9 s < min I — , — 

s s J s s 
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Also, cov(Wm,Ws) = 0. Let Z he a standard normal random variable independent of the 
standard Brownian bridge U. We have shown that 

=d -^/es{l -ds) Z + ^^e^v {-) . 
s \ s / 

Next, let F„ = U (f ) -U (^) for m = 1, . . . , s. The vector Y = (Yi, . . . , Y,) is multivariate 
normal with mean zero and cov{Ym,Ym') = 5m,m'/s — To finish the proof, note that 

gren(c + Y) = c + gren(y) for any constant c. □ 

Proof of Proposition \3.(A The claim for the rearrangement estimator follows directly from 
Theorem 12.41 for = oo. To prove the second claim, we will show that Y^^ — Yn^x = 

V^iPnx ~ Pn,x) 0. To do this, we again use the switching relation (16.91) . 
Fix e > 0. Then 

P(y;% - F„,. > e) = P(p^,, > + n-V^e) 

= P{sn{Pn,x + n-^l^e) >x- 1/2) 

= P(argmax^>_iZ„(y - x) > x - 1/2) 

= P(argmax^>_,„iZ„(/i) > -1/2), (6.11) 

where Zn{h) = n^^'^¥n{x+h) — (n^/'^pn,x+t)ix+h+l). Since for any constant c, aigmax^ (Zn{y)) = 
argmax^(Z„(y) + c), we instead take 

Zn{h) = n^/\¥n{x + h)-¥4x-l))-{n^/%,, + e){h+l) 
= UM + VM + Wn{h)-e{h + l), 

where 

Unih) = V^((F„(x + /i)-F„(x-l))-(P(x + /i)-P(x-l))), 

{h+l)-'Vn{h) = v^((F„(a;)-F„(a;-l))-(P(x)-P(a;-l))), 

n-^/^Wnih) = {F{x + h)- F{x-l))-pxih + l) 

f = for /i = -1,0, 
< otherwise. 

Let U denote the standard Brownian bridge on [0, 1]. It is well-known that f/„(/i) =^ l]{F{x + 
h))-l]{F{x-l)) andVnih) {h + l){V{F{x))-U{F{x-l))). Also, Wn{y) = Oaty = -1,0 
and Wn{y) —^ocfoiy^ { — I? 0}. Define 

Z{h) = \]{F{x + h)) - U(P(x - 1)) + {h+ l)(U(P(a;)) - U(P(x - 1))), 

and notice that Z(0) = Z(— 1) = 0. It follows that the limit of (16. lip is 

P(argmaxJ_i_o{^(^)-e(^ + l)}>-V2) = 0, 
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since argmax^^_^ Q{Z{h) — e{h + 1)} = — 1. A similar argument proves that 

lim P(y„% - < e) = 0, 

n— »oo 

showing that Y^^ — Y^^x = Op(l) and completing the proof. □ 

Proof of Theorem \3.8\ . Let (f denote an operator on sequences in I2. Specifically, we take 
(f = gren or (p = rear. Also, for a fixed mass function p let Tp = {x > : — Px+i > 0} = 
{ri}i>i. Next, define ipp to be the local version of the ip operator. That is, for each i > I, 
= <^(p^""+^'"'+^^)x for all Ti + 1 < X < n+i. 
Fix e > 0, and suppose that g„ — > g in £2- Then there exists a. K E Tp and an no such that 
sup„>„j^ J2x>Kln,x < ^/6- By Lemma Em ipp is continuous on finite blocks, and therefore it 
is continuous on {0, ... , K}. Hence, there exists a such that for all n > n'f^ 

K 

X^(^p(?n,)x - '^p{q)xf < e/3. 

x=0 

Applying (16. 7p . we find that for all n > max{no,no} 

K 

x=0 x>K x>K 

< e/3 + 2j2<llx + 2j2^'<'^ 

x>K x>K 

which shows that ipp is continuous on £2- Since F„ ^ y in £2, it follows, by the continu- 
ous mapping theorem, that (Pp{Yn) =^ (pp(Y). However, both Y^ and Y^ are of the form 
V^{¥^{Pn) — p) V^p(^n)- To complete the proof of the theorem it is enough to show that 

E„ = \ \y/n{(p{pn) -p)- iPpiYn)\\l, 

converges to zero in Li; that is, we will show that -E[E„] 0. 

By Skorokhod's theorem, there exists a probability triple and random processes Y and 
Yn = V^iPn — p), such that F„ — > y almost surely in £2- Fix e > and find K E Tp such 
that Y.x>kP^ < e/4. 

Next, let Tp^ = {0 < x < K : x E Tp}, and let 6 = min^^-j-K (p^ —px+i)- Then, there 
exists an no such that for all n > riQ 

snp\pn,x-Px\ < 5/3, (6.12) 

x>0 

sup I J2 fiPn)y-F{x)\ < 6/6, (6.13) 

0<y<x 

almost surely (see Corollary 12. 5p . 
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Now, consider any m G Tj^ . It follows that any such m is also a touchpoint of the 
operator ip on Pn. Here, by touchpoint we mean that J2'^=o '^{Pn)x = Yl^=oPn,x- From fl6.12p . 
it follows that 

inf pn^^ > snp Pn,x, 

x<m x>m 

which implies that m is a touchpoint for the rearrangement estimator. For the Grenander 
estimator, we require fl6.13l) . Here, 

F^(m) - ^„^(m - 1) > F{m) - F{m - 1) - 5/3 

= Pm-5/3 

= F(m + l)-F(m) + 5/3 

> F^{m + l)-F^{m). 

Therefore, the slope of F^ changes from m to m + 1, which implies that m is a touchpoint 
almost surely. Let jfn^^ = {pn,s,Pn,s+i, ■ ■ ■ ,Pn,r}- An important property of the if operator 
is if m < m' are two touchpoints of if applied to p„, then for all m + 1 < x < m', ip{pn)x = 
^{p^'^^'"^'^)x- Now, since p takes constant values between the touchpoints Tj^ , it follows 
that ^/n{lp{pn) - p)x = ^Pp{Yn)x, for all x < K. 
Therefore, for all n > no 

|2 



En = \ Vri{ip{pn) -p)x -^p{yn)x\ 



x>0 
K 



< {Vn{ip{Pn) -P)x - ^p{yn)xY 
x=Q 

+ 2 {V^i.v{pn)-p)xf + 2 {'^piXn)xy 

x>K x>K 

< 4Yiyn,x)\ 

x>K 



almost surely. It follows that 

ih^E„ < 4 5^ (y; 

x>K 

and hence 



2 

Xj ; 



E IJh^Kn] < AE 



x>K 



4 ~P^^ < ^• 



x>K 



Since E„ < 2||y„||2, with £"[111^111] < 1, we may apply Fatou's lemma so that 



< limE[E„] < E [limEn] < e. 
Letting e — >^ completes the proof. □ 
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Corollaries 13.91 and 13.101 are obvious consequences of Theorem 13. 8[ Remark 13.111 is proved 
in the following section. 



6.3 Limiting distributions for metrics: proofs 



Proof of Corollary \4-l\ We provide the details only in the k = 2 setting. The cases when 
k > 2 follow in a similar manner, since here \\x\\k < H^^lb for a; G £2- 

Convergence of | |l^n| I2, | 1 12 and ||y^||2 follows from Theorems 13.11 and 13.81 by the 
continuous mapping theorem. That ||1^||2 = H^^lb is obvious from the definition of Y^. 
That ||1^*^||2 < ll^lb follows from Jensen's inequality and the definition of the gren(-) 
operator, since for any r < s, gTen{Y^^'^^)x is equal to the average of Yy over some subset of 
{r, . . . , s} containing the point x. If p is not strictly decreasing, then there exists a region, 
which we denote again by {r, . . . , s}, where it is constant. Then there is positive probability 
that is different from F^^'"). In this case, we have that 



2 
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which finishes the proof of the stochastic ordering in the third statement. Convergence in 
expectation is immediate since 



x>0 



and the same results for Y^^, Y^ follow by the dominated convergence theorem and the 
bounds in Theorem [O (i). Lastly, the bound ^[IIF'^III < E[\\Y 
only if p is strictly monotone follows from the stochastic ordering. 



with equality if and 
□ 



Proof of Corollary \4.2\ The result o f the co r ollary for the empirical estimator is essentially 
the Borisov-Durst theorem (see e.g. iDudleyl (Il999l ). Theorem 7.3.1, page 244), which states 
that 



sup 

ce2N 



/ J -' n,x 



sup 

CG2M 



xec 



if J2x \fVx < oc. To complete the argument note that supcg2N I Sxec"^^! ~ for any 

sequence w such that J2x '^^ ~ (note that the condition ^/p^ < 00 means that the 
sequences Y^ and Y are absolutely summable almost surely). However, the result may also 
be proved by noting that the sequence Yn is tight in ii using Lemma 16. 2[ since 

x>0 

J2E[\Yn,x\] < 5^ VPx(1-Px.)^0, 



x>m 



x>m 
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as m ^ cxD under the assumption ^2,>g ^JVx < oo. The proof that =^ Y'~^ and =^ Y^ 
in ii is identical to the proof of Theorem 13.81 and we omit the details. Convergence of 
expectations follows since ||l^n||i is uniformly integrable, as 



-E[||F„||lI{||y„||i>a}] 



< 



< 



E\\\Y 



nlllJ 



a 




by the Cauchy-Schwarz inequality. All other details follow as in the proof of Corollary l4.1[ □ 
Proof of Corollary \4 -SI If k < cx), then we have that 



x=0 



x=0 



/Px\ 



Px 



IPx 



which converges to 



x=0 



Y} 



a;=0 



Px 



(6.14) 



by Theorem 13.11 and Theorem 12.41 for k = oo. That this has a chi-squared dist ribution 
with K degrees of freedom is standard, and is shown for example, in iFergusonl (jl996l ). Theo- 
rem 9. Convergence of means f ollows by the d ominated convergence theorem from the bound 
H{p, q) < \/\\p — (see e.g. Ihe CamI ( 19691 ) . page 35) and Corollary 14. 2[ All other details 
follow as in the proof of Corollary 14.11 □ 

Proof of Remark \4.4\ Suppose first that J2x>oVp^ ~ Define P to be the probability 
measure P{A) = J^xeAP^^ ^ be the mean zero Gaussian field on £^ such that 

E[WxWx'] = PxSx,x'- Then we may write Y =d {Wn, -pxWn}x>o, where = Zlx>o^^- 
Now, since X]a;>o-^(l^^l — VPx) = oo, by the Borel-Cantelli lemma we have that 



Ex>o\w^ 



oo almost surely. Since 



3;>0 



x>0 

> Y,\W,\-\Wn\ 



x>0 



and Wfq is finite almost surely, it follows that J2x>o l^^'l ~ ^ almost surely as well. That is. 



x>0 



rpx 



oo, then the random variable ||l^||i simply does not exist. 
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A similar argument works for the Hellinger norm. Assume that k — oo. Then 



J2— = fe — 

x>0 \x>0 



and the Borel-CanteUi lemma shows that Ylix>o ^x/Px is infinite almost surely. □ 

Lemma 6.3. Let Zi,...,Zk be i.i.d. N(0,1) random variables, and let Zf,i — 
denote the left slopes of the least concave majorant of the graph of the cumulative sums 
X]i=i with j = 0, . . . ,k. Let T denote the number of times that the LCM touches the 
cumulative sums (excluding the point zero, but including the point k). Then 



E 



.i=l 



G\2 



E[T]. 



Proof. It is instructive to first consider some of the simple cases. When k = 1, the result is 



obvious. Suppose then that k = 2. 


We have 




T 




if 


2 


Zl + Zl 


^1 > Z2 


1 


\ V2 ) 





Note that we ignore all equahties, since these occur with probabihty zero. It follows that 



E 



.1=1 



G\2 



E[{Z! + Zl)lz,>z,]+E 



Zi + Z2 

V2 



where, by exchangeability it follows that 

E[{Zl + Zl)lz,>z,] 



On the other hand, we also have that 

2 



= EiiZl + Z',)lz,^z,] 

= E[{Zl + Zl)]P{Z, > Z2) 

= 2P(r = 2). 



E 



Z1 + Z2 
V2 



Zi< 



Z1+Z2 



E 



Zi + Z2 



V2 
1P(T=1), 



P{Z,< 



Z1 + Z2 



since the random variables Z = {Zi + Z2)/2 and Z\ — Z are independent. The result follows. 
Next, suppose that A; = 3. Then we have the following. 
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if 



1 ( ZI+Z2+ZA 



(a) 



(b) 



Zi > Z2 > Z2, 

^1 > ^2±^ 



Zl±^ > Zi 
Zl+Z2+Z3 ^ 7 Z1+Z2 

— 3 — > ^1, -^r~ 



The choice of sphtting the conditions between cohimns (a) and (b) is key to our argument. 
Note that the LCM creates a partition of the space {1, . . . , /c}, where within each subset the 
slope of the LCM is constant. The number of partitions is equal to T. Here, column (a) 
describes the necessary conditions on the order of the slopes on the partitions, while column 
(b) describes the necessary conditions that must hold within each partition. 

In the first row of the table, we find by permuting across all orderings of (123) that 

E[{Zl + Zl + Zl)lz,>z2>z,] = E[{Zl + Zl + Zl)]P{Z,>Z2>Z^) 

= 3P(r = 3). 

Next consider T — 2. Here, by permuting (123) to (312), we find that 

2' 



E 



2 / ^2 Z^ \ 1 



V2 

E 



Z1 + Z2 
V2 



+ zi 



Zz> 



?1 + Z2 l£l±£2 



>Zx 



Note that the permutation (123) to (312) may be re-written as ({12}{3}) to ({3}{12}) which 
is really a permutation on the partitions formed by the LCM. Now, 



EXTl 



T=2 



= E 



+ E 



E 



E 



Z1 + Z2 
x/2 

zl + 

a/2 

Z1 + Z2 



+ z?Aiz^ 



'->Z3 



1 Z1+Z2 



>Zi 



Z2 + Z^ 

~7r 



Zx> 



Z2+Z3 i 



>Z2 



-\- Z^\ 1 Z1+Z2 



>Zi 



V2 

= 2P(T = 2), 



+ zl 



P \ > z. 
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where in the penultimate hne we use the fact that Z3, {Zi + Z2) /2 and Zi — {Zi + Z2) /2 are 
independent. 
Lastly, 



E 



Z\ + Z2 + Z-^ 



V3 



1 21+^2+^3 >Zi 21+^2+^3 ^ ^1+^2 



E 



= E 



Zi -\- Z2 -\- Z2 

Zl + Z2 -\- •Z'q 



1 Zi+Z2+^3 ^ 7^ 1 



x/3 



1 Zi+Za+^a > 1 Z1+Z2+Z3 ^ Z1+Z2 



1P(T = 1) 



as the variables Z — {Z\ + Z2 + Z^) /3 and {Zi — Z, Z2 — Z, Z^ — Z} are independent. 
The key to the general proof is the combination of two actions: 

1. Permutations of subgroups (column (a)), and 

2. independence of column (b) from the random variables and the indicator 
functions in column (a). Note that for any k > j > 1, letting Z = {Z1 + Z2 + . . . + Zk)/k 



Z1 + Z2 + ... + Z. 



z = 



{Zi- Z) + {Z2- Z) + ... + {Zj - Z) 



which is independent of Z for any choice of j < k. 
To write down the proof for any k we must first introduce some notation. 

• For any 1 < m < A;, wc may create a collection V of partitions of {1, ... , k} such that 
the total number of elements in each partition is ni. For example, when k = 4 and m = 
2, then the elements of V are the partitions ({1}{234}), ({12}{34}) and ({123}{4}). 
Furthermore, for each partition, we may write down the number of elements in each 
subset of the partition. Here the sizes of the partitions are 1,3 then 2,2 and 3,1. 
These partitions may be grouped further by placing together all partitions such that 
their sizes are unique up to order. Thus, in the above example we would put together 
1,3 and 3,1 as one group, and the second group would be made up of 2,2. From 
each subgroup we wish to choose a representative member, and the collection of these 
representatives will be denoted as r(m). We assume that the representative r is chosen 
in such a way that the sizes of the partitions are given in increasing order. Let ri 
denote the number of subgroups with size 1, and so on. Thus, for r = ({1}{234}), we 
have ri = 1, r2 = 0, ra 1, . . . , Tfe = 0. 

• Next, from T(m) we wish to re-create the entire collection V. To do this, it is sufficient 
to take each r and re-create all of the partitions which had the same sizes. Let a^T 
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denote the resulting collection for a fixed partition r. Thus, V is equal to the union of 
(TmT over all r e T(m). Note that the number of elements in cr^r is given by 



m 

rir2 ... Tk 



We also use the notation Rj = Yyi=i ^ii with Rq = 0. Note that Rk = m. 

• For each partition a, we write ai, . . . , o"^ to denote the individual subsets of the par- 
tition. Thus, for a — ({1}{234}), we would have ai — {1} and (72 = {2, 3, 4}. 

• For each aj as defined above, we let 

AV.^Z = Z.) /k.i and AV-/Z = (e,,,(o Z,>j /\af\ 



where a^"^ denotes aj with its last I elements removed. 



We are now ready to calculate -E[X^^=i(-^f^)^lT=m]- By considering all possible partitions, 
this is equal to the sum over all r e T(m) of the following terms 



J2 Wj\ {AV.^Zf lAV.^z>...>AV.^z\[l 



By permuting each cr e cr^T, and appealing to the exchangeability of the Zj's, this is equal 
to 



E 



X 



E 



AVajZ>ma^{AVa^^ Z,...,AVa.'^^'''^ Z} 



^i<.,i(Av„,z)n n 



. i=l 



xE 



,i=i 



by independence of each AV^.Z and each — AV^.Z for i & aj. Notice that the permuta- 
tions of 0" G 0"mT do not account for permutations across all groups with equal "size". By 
considering furthermore all permutations between groups of equal size, we further obtain 



34 



that the last display above is equal to 



E 



.i=i 



E 



m E 



xE 

k 



n AVrr ■ Z>m3.x{AVa^ Z,...,AVa ^^"^^ Z} 



E 



Lastly, we collect terms to find that E[^,-^^{Zf^yiT=m\ is equal to m times 



i=l 



AV^R-_ ^^Z>...>AVa^ Z 



E 



0=1 



= E E^ 

= P{T = m), 
which concludes the proof. 



^AV^^Z>...>AV^^Z 



□ 



Proof of Proposition \4-5{ In light of Proposition 13.41 and the definition of (along with 
some simple calculations), it is sufficient to prove that 



(s-r + l)E 



gren(y*-''' 



X 



s—r ^ 



(6.15) 



using the notation of the Proposition 13.41 Without loss of generality we may assume that 
r = 0, and for simplicity we write Y for Y^^''^\ 

Let k = s + 1, and let Zi, . . . , Zk denote k i.i.d. N(0,1) random variables, 1 et Z denote 
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their average, and let Zj = Zi — Z (which is independent of Z\ We then have that 



E 



J^gren(Z)^ 



E 



E 



E 



E 



^gren(Z + Z] 



x=l 
■ k 



^ |gren(Z)^ + Z 

x=l 
k 

^ |gren(Z) 



x=l 

k 



x=l 



^|gren(Z)^| 

x=l 



+ 1 



{y + l)E 



^ |gren(F), 

z=0 



+ 1 



Therefore, by Lemma [6. 3 [ to prove (16. 151) . it is sufficient to show that 



E 



^ gren(Z) 



x=l 



k 



i=l 



where T denotes the number of touchpoints of the LCM with the cumulative sums of the 

Z[s. 

To do this, we use the results of ISparre Andersen! (jl954l ). He considers exchangeable 
random variables Xi, X2, . . . and their partial sums 5*0 = 0, 5*1, 5*2, . . . , = Ym=i -^ii ^^'^ 
shows that the number if„ of values i G {l,...,n — 1} for which S-i coincides with the least 
concave majorant (equivalently the greatest convex minorant) of the sequence Sq, . . . ,Sn has 
mean given by 



n 



as long as the random variables Xi, . . . , X„ are symmetrically dependent and 

P{S,/i = Sj/j) =0, l<t<j<n. 

The vector Xi, . . . ,X„ is symmetrically dependent if its joint cumulative distribution func- 
tion P(Xi < Xj, i = 1, ■ ■ . , n) is a symmetric function of xi, . . . , x„. This result is Theorem 5 in 
Sparre Andersen! (!l954i ). Clearly, we have that E[T— 1] = E[Hk], for Xi = Zi, . . . , X„ = Zk, 
which are exchangeable, and satisfy the required conditions. The result follows. □ 

Proof of Remark \3.11[ To prove this result we continue with the notation of the previous 
proof. Equality of grenfF) with Y holds if and only if the above partition T = {0, . . . , y}. 
By Theorem 5 of !Sparre Andersen! (11954! ). this occurs with probability l/{y + □ 
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Proof of Remark \4.6[ By Proposition 13.41 (and using the notation defined there), it is enough 
to prove that 



E[gTeniY)l] < 



T 



where for simphcity we write Y = Y'-^'^\ Let {VF2:}^=r be i.i.d. normal random variables 
with mean zero and variance 1/r, and let W = {Yll:=r^^)/'^- Then Y = Z — Z, and also 
gren(Z)a; = gren(Z — Z)^ + Z. Notice also that Z — Z and Z are independent. We therefore 
find that 



E[gren{Y)l] + l/i 



= E[gren{Z)l] 
< E[Zl] = l/T, 



the latter inequality following directly from Theorem 1.6.2 of iRobertson et al.l (1l988l ). since 
the elements of Z are independent. □ 



6.4 Estimating the mixing distribution: proofs 

Proof of Theorem \5.1[ Since | |gn— g| |fc < | lo'n— Q'l |i and H{qn, q) < \/\\qn — li, it is sufficient 
to only consider convergence in the ii norm. Note that 

\qn,x-qx\ < {X + 1) {\pn,x+l - Px+l\ + \Pn,x - Px\} , 

and therefore we may further reduce the problem to showing that J2x>o AVn,x —Px\ converges 
to zero. 

For pn = Pn,x, we have that for any large K 

^ x\Pn,x -Px\ < Ksnp \pn,x - Px\ + ^ Xp^ + XPn,x, 
x>0 x>K x>K 

and since -Ep[-^] exists by assumption, it follows from the law of large numbers that for 
any K, 

^ ^ XPn,x ^ ^ ^ XPxy 
x>K x>K 

almost surely. The proof now proceeds as in the proof of Theorem 12.41 

For the rearrangement estimator and the MLE, we may use the same approach. The key 
is to note that 'Yl,x>K^Pri,x < 'Yl,x>K^Pri,x^ for any K and for both pn = Pn^P^- This holds 
since fx = ^x>kx is an increasing function and therefore (16. 6p of Lemma [6.11 applies. □ 

Proof of Theorem Since k < oo by assumption, the theorem follows directly from the 
results of Sections [3] and HJ as well as Theorem 15. 1[ □ 
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